How Popular is Your Paper? An Empirical Study of the Citation Distribution 
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Numerical data for the distribution of citations are examined for: (i) papers published in 1981 in 
journals which are catalogued by the Institute for Scientific Information (783,339 papers) and (ii) 20 
years of publications in Physical Review D, vols. 11-50 (24,296 papers). A Zipf plot of the number 
of citations to a given paper versus its citation rank appears to be consistent with a power-law 
dependence for leading rank papers, with exponent close to —1/2. This, in turn, suggests that the 
number of papers with x citations, N(x), has a large- a; power law decay N(x) ~ x~ a , with a m 3. 

PACS Numbers: 02.50.+S, 01.75.+m, 89.90.+n 
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In this letter, I consider a question which is of rele- 
vance to those for whom scientific publication is a pri- 
mary means of scholarly communication. Namely, how 
often is a paper cited? While the average or total num- 
ber of citations are often quoted anecdotally and tab- 
ulations of highly-cited papers exist the focus of 
this work is on the more fundamental distribution of ci- 
tations, namely, the number of papers which have been 
cited a total of x times, N(x). In spite of the fact that 
many academics are obliged to document their citations 
for merit-based considerations, there have been only a 
few scientific investigations on quantifying citations or re- 
lated measures of scientific productivity. In a 1957 study 
based on the publication record of the scientific research 
staff at Brookhaven National Laboratory, Shockley ||] 
claimed that the scientific publication rate is described 
by a log-normal distribution. Much more recently, La- 
herrere and Sornette [Q have presented numerical evi- 
dence, based on data of the 1120 most-cited physicists 
from 1981 through June 1997, that the citation distri- 
bution of individual authors has a stretched exponential 
form, N(x) cx exp[— [x/xq)^\ with /3 as 0.3. Both papers 
give qualitative justifications for their assertions which 
are based on plausible general principles; however, these 
arguments do not provide specific numerical predictions. 

Here, the citation distribution of scientific publications 
based on two relatively large data sets is investigated || . 
One (ISI) is the citation distribution of 783,339 papers 
(with 6,716,198 citations) published in 1981 and cited be- 
tween 1981 - June 1997 that have been cataloged by the 
Institute for Scientific Information. The second (PRD) 
is the citation distribution, as of June 1997, of the 24,296 
papers cited at least once (with 351,872 citations) which 
were published in volumes 11 through 50 of Physical Re- 
view D, 1975-1994. Unlike Ref. ||, the focus here is on 
citations of publications rather than citations of specific 
authors. A primary reason for this emphasis is that the 
publication citation count reflects on the publication it- 
self, while the author citation count reflects ancillary fea- 
tures, such as the total number of author publications, 
the quality of each of these publications, and co-author 
attributes. Additionally, only most-cited author data is 
currently available; this permits reconstruction of just 



the large-citation tail of the citation distribution. 
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FIG. 1. (a) Citation distribution from the 783,339 papers 
in the ISI data set (A) and the 24,296 papers in the PRD 
data set (o) on a double logarithmic scale. For visual refer- 
ence, a straight line of slope —3 is also shown, (b) Same as 
(a), except on a semi- logarithmic scale. The solid curves are 
the best fits to the data for x < 200 (PRD) and x < 500 (ISI) 



The main result of this study is that the asymptotic 
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tail of the citation distribution appears to be described 
by a power law, N(x) ~ x~ a , with a w 3. This con- 
clusion is reached indirectly by means of a Zipf plot (to 
be defined and discussed below), however, because Fig. 1 
indicates that the citation distribution is not described 
by a single function over the whole range of x. 

Since the distribution curves downward on a double 
logarithmic scale and upward on a semi-logarithmic scale 
(Figs. 1(a) and (b) respectively), a natural first hypoth- 
esis is that this distribution is a stretched exponential, 
N{x) oc exp[— (x/xq)P}. Visually, the numerical data fit 
this form fairly well for x < 200 (PRD) and x < 500 (ISI) 
as indicated in Fig. 1(b), with best fit values (3 ~ 0.39 
(PRD) and /3 w 0.44 (ISI). However, the stretched expo- 
nential is unsuitable to describe the large- a; data. Here, 
data points are widely scattered, reflecting the paucity of 
well-cited papers. For example, in the ISI data, only 64 
out of 783,339 papers are cited more than 1000 times, 282 
papers are cited more than 500 times, and 2103 papers 
are cited more than 200 times, with the most-cited paper 
having 8907 citations. Such a sparsely populated tail is 
not amenable to being directly fit by a smooth function. 
(Amusingly (or soberingly) 633,391 articles in the ISI set 
are cited 10 times or less and 368,110 are uncited.) 

Another test to determine the functional form of N(x) 
is to compare numerical values for the moments of the 
citation distribution 



(x k ) = 



J x k N{x) dx 
J N(x) dx ' 



(1) 



with those obtained by assuming a given form for 
N(x). For example, if the citation distribution is 
a stretched exponential, then the dimcnsionless ratios 
M k = (x k )/{x) k = r(^±i)r(I) fe -Vr(|) fc , where T(x) 
is the gamma function. Notice that the scale factor xq in 
the exponential cancels. For each k, an estimate for /3 can 
be inferred by matching the value of Mk obtained from 
the above gamma function formula with the correspond- 
ing numerical data. For both the ISI and PRD data, 
the corresponding estimates for f3 for k — 2, 3, . . . , 6 de- 
pend weakly but non-systematically on k, and further do 
not match the values for /? obtained from a least-squares 
fit to a stretched exponential (Fig. 1(b)). Similarly, the 
numerical data for (x k ) also do not match a power-law 
form for the citation distribution, N(x) ~ x~ a . These 
results provide evidence that the citation distribution is 
not described by a single function over the entire range 
of citation count. 

More fundamentally, it is natural to expect different 
underlying mechanisms and different statistical features 
between minimally-cited and heavily-cited papers. The 
former are typically referenced by the author and close 
associates, and such papers are typically forgotten a short 
time after publication. Evidence for such a short lifetime 
of minimally-cited papers can be found, e.g., by compar- 
ing the small-citation tail of N(x) for the first 4 years 
(1975-79) and the last 4 years (1990-1994) of the PRD 



data set. For x < 200, these data (appropriately normal- 
ized) and the complete PRD data are virtually identi- 
cal. On the other hand, well-cited papers become known 
through collective effects and their impact also extends 
over long time periods. This is reflected in the signif- 
icant differences among the large-citation tails of N(x) 
for papers of different eras. 

To help expose these differences in the citation distri- 
bution, it is useful to construct a Zipf plot 0, in which 
the number of citations of the k th most-ranked paper 
out of an ensemble of M papers is plotted versus rank k 
(Fig. 2). By its very definition (see Eq. @)), the Zipf plot 
is closely related to the cumulative large-x tail of the ci- 
tation distribution. This plot is therefore well-suited for 
determining the large- a; tail of the citation distribution. 
The integral nature of the Zipf plot also smooths the 
fluctuations in the high-citation tail and thus facilitates 
quantitative analysis. 

Given an ensemble of M publications and the corre- 
sponding number of citations for each of these papers in 
rank order, Y\ > Yq, > . . . > Ym, then the number of ci- 
tations of the k th most-cited paper, Yj., may be estimated 
by the criterion M 



N(x) dx = k. 



(2) 



This specifies that there are k publications out of the en- 
semble of M which are cited at least times. Eq. (|^) 
also represents a one-to-one correspondence between the 
Zipf plot and the citation distribution. From the depen- 
dence of Yk on k in a Zipf plot, one can test whether it 
accords with a hypothesized form for N(x). 

In Fig. 2(a), a Zipf plot of the rank-ordered citation 
data is presented on a double logarithmic scale for 4 data 
sets: (a) ISI data (top 200,000 papers only), (b) complete 
PRD data (24,296 papers), (c) first 4 years of PRD data, 
vols. 11-18 (5044 papers), and (d) last 4 years of PRD 
data, vols. 43-50 (5467 papers). As alluded to previously, 
there is a considerable difference between the first and 
last 4 years of the PRD data. As might be anticipated, 
the more recent highly-cited papers (up to approximately 
rank 700) are cited less than papers in the earlier sub- 
data. (There are two exceptions, however. These are 
the two top papers in the first 4 years which are cited 
1741 and 1294 times, while in the last 4 years of data the 
two leading papers are cited 2026 and 1420 times.) The 
larger citation count of heavily-cited older papers reflects 
the obvious fact that popular but recent PRD papers are 
still relatively early in their citation history. This is in 
sharp contrast to poorly-cited papers where there is little 
difference in the citation count from the first 4 years and 
the last 4 years of the PRD data. 
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most heavily-cited papers. This arises both from the 
fact that the ISI data involves approximately 30 times 
more publications than the PRD data and that the av- 
erage number of citations to ISI papers is approximately 
one-half that of PRD papers. From the available data, 
the highly-ranked ISI publications therefore provide the 
best representation of the asymptotic tail of the citation 
distribution. For ISI publications between rank 1 (8904 
citations) and 12,000 (approximately 85 citations), the 
data is fairly linear and a least-squares data fit in this 
range yields an exponent for the Zipf plot of Fig. 2(b) of 
approximately —0.48. By inverting Eq. (||), this power 
law is equivalent to the distribution of citations also hav- 
ing a power law form N(x) oc x~ a for large x, with 
a = l + 1/.48 » 3.08. 

This power-law behavior suggests a reconsideration of 
the citation distribution in this range of >85 citations 
(Fig. 1(a)). The curvature in the data decreases signifi- 
cantly for x > 85 and it is not unreasonable to attempt 
a power law fit, but with the additional caveat that the 
citation data beyond approximately 500 citations is dom- 
inated by fluctuations. Consequently there is subjectivity 
in specifying the range over which this fit is performed. 
Least-squares fits to the data within 50 < x < 1000 give 
exponent estimates in the range 2.6 - 2.8, and a fit for 
85 < x < 500, where the data are visually the most lin- 
ear, both in the citation distribution and in the Zipf plot, 
gives a ks —2.7. The correspondence between these fits 
and those in the Zipf plot therefore suggest that the cita- 
tion distribution may have a power-law tail, N(x) ~ x~ a , 
with exponent a close to 3. 



FIG. 2. (a) Zipf plot of the number of citations of the 
fc* -ranked paper Yk versus rank A: on a double logarithmic 
scale, (a) ISI data (— — ), (b) PRD data (- - - -), (c) vols. 

11-18 of PRD ( 

(b) The data of (a) 



-), (d) vols. 43-50 of PRD ( -). 

in scaled units. For visual reference, a 
straight line of slope —1/2 is also shown. 



To interpret the apparently unsystematic data in the 
Zipf plot of Fig. 2(a) effectively, it is instructive to scale 
the data. Since k ranges between 1 and the number of 
publications in the ensemble, it is natural to define a 
scaled relative rank k/Mi, where Mi is the total number 
of papers in each of the 4 data sets in Fig. 2, (i = 1, 2, 
3, or 4). Similarly, for the ordinate, it is useful to define 
a scaled citation count for the k th most-cited paper by 
Yk/ {x)i, where (x)i is the average number of citations for 
all papers in the i th data set. As shown in Fig. 2(b), there 
is relatively good collapse of the 4 data sets onto a single 
universal curve. Notice also that the disparity in the two 
PRD data subsets appears as a relatively small fluctua- 
tion about a mean value. The data collapse also provides 
a strong clue about the location of the asymptotic regime 
for citation data. 

Of particular relevance for the citation distribution, 
this scaling plot indicates that the ISI data extends 
deeper than the PRD data into the asymptotic tail of 



TABLE I. Annual citation data from PRD as of June 1997 
including the first three moments of the citation distribution 
and the citation count of the most-cited paper. 



Year 


# articles 


(x) 






^max 


1975 


1369 


19.3 


80.0 


168.8 


1294 


1976 


1085 


17.8 


71.8 


178.5 


1741 


1977 


1328 


16.9 


61.5 


126.6 


846 


1978 


1262 


16.5 


56.6 


123.2 


1066 


1979 


1229 


17.8 


57.5 


114.7 


907 


1980 


1114 


18.5 


61.9 


126.2 


912 


1981 


1107 


17.0 


62.5 


148.8 


1449 


1982 


1116 


13.0 


32.4 


56.7 


340 


1983 


1100 


14.6 


51.1 


107.7 


813 


1984 


1090 


14.4 


46.9 


107.6 


1004 


1985 


1094 


13.8 


34.6 


66.7 


579 


1986 


1222 


11.8 


24.7 


38.9 


215 


1987 


1275 


12.7 


35.6 


79.5 


772 


1988 


1124 


11.4 


23.6 


38.8 


244 


1989 


1153 


12.0 


29.8 


62.6 


600 


1990 


1161 


13.0 


29.7 


57.5 


515 


1991 


1083 


13.2 


34.5 


74.9 


622 


1992 


1388 


13.3 


47.3 


130.0 


1420 


1993 


1436 


11.5 


23.0 


40.0 


311 


1994 


1560 


11.9 


55.9 


175.4 


2026 
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Another important aspect of citation statistics which 
emerges from Fig. 2(a) is its continuing temporal evolu- 
tion. This feature is nicely illustrated by the annual ci- 
tation statistics of PRD publications, where the average 
number of citations (x) for articles published in a given 
year is typically decreasing slowly with time (Table 1). It 
is interesting that the existence of a single exceptionally 
well-cited paper in a particular year has an imperceptible 
effect on (x) but a much larger influence on higher-order 
moments. Notice also that the total number of citations 
to PRD papers published in a given year, even as far 
back as 1975 (the first year for which data is available), 
is slowly increasing. Since papers from this period which 
are still currently being cited are also likely to be highly 
cited, this implies that the large- x tail of the citation dis- 
tribution has not yet reached its final state. Because of 
this continuing evolution of the citation distribution, one 
cannot expect that the properties of the high-citation tail 
of the citation distribution will be accurately determined 
by direct analysis. 

In summary, the citation distribution provides basic 
insights about the relative popularity of scientific pub- 
lications and provides a much more complete measure 
of popularity than the average or total number of cita- 
tions. At a basic level, most publications are minimally 
recognized, with « 47% of the papers in the ISI data 
set uncited, more than 80% cited 10 times or less, and 
rs .01% cited more than 1000 times. The distribution 
of citations is a rapidly decreasing function of citation 
count but does not appear to be described by a single 
function over the entire range of this variable. Although 
the available data is extensive, it still appears insufficient 
to quantify the tail of the citation distribution unam- 
biguously by direct means. However, a Zipf plot of the 
citation count of a given paper versus its citation rank 
indicates a substantial range of power law behavior with 
exponent close to —1/2. This provides indirect evidence 
that the citation distribution has a power law asymp- 
totic tail, N(x) ~ x~ a with a w 3. This differs from the 
conclusion of Ref . 0] , where the citation distribution of 
individual authors was argued to have a stretched expo- 
nential tail. 

Another important aspect of citations is that comput- 
erized data are relatively recent, and the PRD data in- 
dicates that citation statistics from 1975 are still evolv- 
ing. Thus even the more extensive ISI data set is still 
too recent to provide an accurate picture of the long- 
time and large- citation tail of the citation distribution. 
It should therefore be worthwhile to study the proper- 
ties of older citation data. Alternatively, information of 
a related genre, such as the distribution of sales for a 
particular class of books, or ticket sales for movies and 



theaters may provide useful data for studying citation- 
related statistics. 

Finally, the citation distribution provides an appealing 
venue for theoretical modeling. There are several quali- 
tative features about citations which should be essential 
ingredients for a theory of their distribution. Since al- 
most all papers are gradually forgotten, the probability 
that a given paper is cited should decrease in time with 
a relatively short memory. Conversely, a paper which is 
in the process of becoming recognized gains increasing 
attention through citations. This suggests that the prob- 
ability of a paper being cited at a given time should be 
an increasing function of the relative number of citations 
to that paper from an earlier time period. Work is in 
progress to construct a model for the citation distribu- 
tion which is based on these considerations. 
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See e.g., Science Citation Index Journal Citation Reports 
(Institute for Scientific Information, Philadelphia) for an- 
nual lists of top-cited journals and articles (web site: 
http://www.isinet.com/welcome.htmj ). 
For example, current lists of top-cited articles in 
high-energy physics are maintained by the SPIRES 
High-Energy Physics Database at SLAC (web site 



[3] 

[4] 
[5] 



http://www.slac.stanford.edu/find/top40.htmj ). 

W. Shockley, Proc. IRE 45, 2 79 (1957). 

J. Laherrere and D. Sornette, |sond-mat/9801293| . 
The PRD data was provided by H. Galic from the SPIRES 
Database. The ISI data was provided by D. Pendlebury 
and H. Small of the Institute for Scientific Information. 
These two data sets and related citation data are avail- 



able from my web site http://physics.bu.edu/~redner 
G. K. Zipf, Human Behavior and the Principle of Least 
Effort (Addison- Wesley, Cambridge, 1949). 
This is a basic exercise in extreme value statistics. See e.g., 
J. Galambos, The Asymptotic Theory of Extreme Order 
Statistics, (J. Wiley & Sons, New York, 1978). 
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